158 research outputs found

    Alignement multilingue pour l'Ă©tude contrastive : outils et applications

    No full text
    International audienceI corpora multilingui sono oggi facilmente accessibili in varie lingue, in grande quantitĂ . Si possono inoltre trovare molti strumenti dedicati al trattamento di questi corpora : applicazioni per l'allineamento, concordancer, strumenti per la marcatura e l'annotazione XML, che sono distribuiti a volte con licenze gratuite o GPL. Tradizionalmente usati dai traduttori professionisti (per la costruzione delle Memorie di traduzione), o dai specialisti del trattamento automatico della lingua (i corpora di testi tradotti entrano nella confezione di certi sistemi di traduzione automatica), questi strumenti e dati linguistici sono ancora poco sfruttati nel campo della linguistica dei corpora. Pensiamo invece che costituiscono una risorsa molto interessante per lo studio empirico dei contrasti linguistici, tanto al livello lessicale quanto a quello della morfosintassi.Nella prima parte dell'articolo, presentiamo le tecniche piĂą usate, nello scopo di valutare i risultati attuali dell'allineamento automatico. PiĂą pecificamente, daremmo una descrizione dei metodi generici implementati nel software Alinea, di cui i risultati sono stati misurati durante la recente campagna di evacuazione Arcade 2. Ne vedremmo le funzionalitĂ  principali : l'allineamento delle frasi, l'estrazione di corrispondenze lessicali, la ricerca di concordanze usando richieste bilingui su testi eventualmente etichettati e lemmatizzati, e anche l'estrazione automatica di un glossariobilingue.Infine daremmo due esempi di osservazione contrastiva basate su queste funzionalitĂ , che permettono di ricercare e comparare espressioni o costruzioni linguistiche complesse non solo al livello morfosintattico, ma anche sul piano delle strutture semantiche, seguendo le reti di differenzee analogie intrecciate dalle relazioni di equivalenza traduzionale

    Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation

    Full text link
    Recent works in spoken language translation (SLT) have attempted to build end-to-end speech-to-text translation without using source language transcription during learning or decoding. However, while large quantities of parallel texts (such as Europarl, OpenSubtitles) are available for training machine translation systems, there are no large (100h) and open source parallel corpora that include speech in a source language aligned to text in a target language. This paper tries to fill this gap by augmenting an existing (monolingual) corpus: LibriSpeech. This corpus, used for automatic speech recognition, is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. After gathering French e-books corresponding to the English audio-books from LibriSpeech, we align speech segments at the sentence level with their respective translations and obtain 236h of usable parallel data. This paper presents the details of the processing as well as a manual evaluation conducted on a small subset of the corpus. This evaluation shows that the automatic alignments scores are reasonably correlated with the human judgments of the bilingual alignment quality. We believe that this corpus (which is made available online) is useful for replicable experiments in direct speech translation or more general spoken language translation experiments.Comment: LREC 2018, Japa

    Mastering Overdetection and Underdetection in Learner-Answer Processing: Simple Techniques for Analysis and Diagnosis.

    Get PDF
    International audienceThis paper presents a "didactic triangulation" strategy to cope with the problem of reliability of NLP applications for Computer Assisted Language Learning (CALL) systems. It is based on the implementation of basic but well mastered NLP techniques, and put the emphasis on an adapted gearing between computable linguistic clues and didactic features of the evaluated activities. We claim that a correct balance between noise (i.e. false error detection) - and silence (i.e. undetected errors) is not only an outcome of NLP techniques, but of an appropriate didactic integration of what NLP can do well - and what it cannot do. Based on this approach, ExoGen is a prototype for generating activities such as gapfill exercises. It integrates a module for error detection and description, which checks learners' answers against expected ones. Through the analysis of graphic, orthographic and morphosyntactic differences, it is able to diagnose problems like spelling errors, lexical mix-ups, errors prone agreement, conjugation errors, etc. The first evaluation of ExoGen outputs, based on the FRIDA learner corpus, has yielded very promising results, paving the way for the development of an efficient and general model adapted to a wide variety of activities

    Multi-Task sequence prediction for Tunisian Arabizi multi-level annotation

    Get PDF
    In this paper we propose a multi-task sequence prediction system, based on recurrent neural networks and used to annotate on multiple levels an Arabizi Tunisian corpus. The annotation performed are text classification, tokenization, PoS tagging and encoding of Tunisian Arabizi into CODA* Arabic orthography. The system is learned to predict all the annotation levels in cascade, starting from Arabizi input. We evaluate the system on the TIGER German corpus, suitably converting data to have a multi-task problem, in order to show the effectiveness of our neural architecture. We show also how we used the system in order to annotate a Tunisian Arabizi corpus, which has been afterwards manually corrected and used to further evaluate sequence models on Tunisian data. Our system is developed for the Fairseq framework, which allows for a fast and easy use for any other sequence prediction problem

    Routines sémantico-rhétoriques dans l’écrit scientifique de sciences humaines : l’apport des arbres lexico-syntaxiques récurrents

    Get PDF
    Les écrits scientifiques se caractérisent par un sociolecte présentant des propriétés linguistiques spécifiques, notamment sur le plan phraséologique. Nous nous intéressons ici aux routines sémantico-rhétoriques, parfois appelées patrons, tournures, ou motifs, par lesquelles les scripteurs s’inscrivent dans une « communauté de discours ». Après avoir défini plus précisément les propriétés linguistiques de ces routines, notre étude aborde les aspects méthodologiques liés à leur mise en évidence dans une approche de linguistique de corpus outillée. Nous étudions ainsi les résultats d’une méthode fondée sur l’exploitation d’annotations syntaxiques en dépendance. Nous montrons que l’extraction d’arbres lexico-syntaxiques récurrents ouvre des perspectives intéressantes dans le domaine, les résultats extraits étant à la fois moins bruités, plus complets et mieux structurés, ce qui permet de mieux associer des routines telles que [il est {frappant/intéressant/important} de {constater/noter/observer/voir}] à des fonctions rhétoriques récurrentes dans le corpus. Quelques exemples de ces routines sémantico-rhétoriques intégrant des verbes de constat sont présentés.Scientific writing is characterized by a sociolect with specific features, and in particular phraseological features. This study is dedicated to semantico-phraseological routines, called in French patrons (viz. “patterns”), tournures or clichés, and by which writers become part of a discourse community. We first define the linguistic properties of a routine and explore methodological aspects for identifying these routines by means of corpus linguistics techniques. We show that the technique of identifying recurring lexico-syntactic trees based on treebanks offers interesting perspectives as results appear simultaneously less “noisy”, more complete and better structured. The latter is especially pertinent when examining routines associated with specific rhetorical functions, such as [il est {frappant/intéressant/important} de {constater/noter/observer/voir}]. Examples of semantico-rhetorical routines involving stative verbs are also discussed

    Des motifs séquentiels aux motifs hiérarchiques : l’apport des arbres lexico-syntaxiques récurrents pour le repérage des routines discursives

    Get PDF
    Cet article propose une réflexion à la fois théorique et méthodologique sur les objets de la phraséologie étendue, qui s’intéresse à des unités préfabriquées du discours au-delà des critères de figement. Plus précisément, nous tentons de clarifier le concept général de motif, ainsi que celui, plus spécifique, de routine discursive. Nous proposons ensuite de comparer deux approches méthodologiques différentes pour l’identification des routines en corpus : une méthode hiérarchique, basée sur le repérage d’arbres lexico-syntaxiques récurrents (ALR), et la méthode séquentielle classique des segments répétés (SR) ou n-grams. Nous montrons, au travers d’une étude sur corpus, que la méthode des ALR présente un réel intérêt pour le repérage des routines et des collocations, mais que les SR semblent plus adaptés et plus simples à mettre en œuvre pour des locutions figées ou des constructions syntaxiques impliquant des lexèmes grammaticaux – le modèle syntaxique des ALR nécessitant une adaptation pour pouvoir identifier ces cas.This article proposes a theoretical and methodological reflection in the field of extended phraseology, which focuses on prefabricated units of discourse. More precisely, we try to clarify the concepts of motif and discursive routine. We propose to compare two different methodological approaches for the identification of routines in corpora: a hierarchical method based on the identification of Recurrent Lexico-syntactic Trees (RLT) and the classical sequential n-gram method. We show, through a corpus study, that the RLT method has a real interest in spotting routines and collocations, but that the n-grams seem more adapted and easier to implement for frozen locutions or syntactic constructions. The underlying syntactic model of RLT would require some adaptation to be able to identify these latter cases

    Le lexicoscope : un outil d’extraction des séquences phraséologiques basé sur des corpus arborés

    No full text
    International audienceCet article présente les fonctionnalités du Lexicoscope, une architecture dédiée à l’exploration de corpus arborés. Après l’examen de quelques outils similaires, nous montrons comment la caractérisation d’expressions complexes (correspondant à des arbres syntaxiques) dans la recherche de concordances et l’extraction de tableaux de cooccurrents peut se révéler très utile pour l’étude des collocations et de la combinatoire lexico-syntaxique en général. Nous décrivons également un scénario d’utilisation permettant à l’utilisateur d’explorer finement les contextes des expressions cibles sans devoir s’initier au formalisme du langage de requête sous-jacent, grâce à une modalité originale d’interrogation basée sur l’exemple

    Extraction automatique de correspondances lexicales : Ă©valuation d'indices et d'algorithmes

    No full text
    International audienceno abstrac
    • …
    corecore